Code
exped <- read_csv("data/exped_tidy.csv")
peaks <- read_csv("data/peaks_tidy.csv")
colnames(peaks)
head(peaks)
colnames(exped)
head(exped)
table(exped$SUCCESS1, useNA = "ifany")INFO 526 Final Project Proposal
Analysis of Himalayan expeditions from 2020–2024
Trevor Macdonald & Nandakumar Kuthalaraja
School of Information, University of Arizona
High-altitude mountaineering offers a natural laboratory for studying how human, environmental, and logistical factors interact to shape risk and achievement. Using tidy versions of the Himalayan Database, this project leverages R (tidyverse, tidy models) and Quarto to build an interactive, reproducible data-visualisation narrative around two guiding questions:
What factors contribute to the success or failure of a summit?
We knit together expedition, peak, and temporal attributes to explore how season, route choice, peak height, team size, leader nationality, weather windows, and oxygen strategy correlate with summit outcomes. Visual encodings include faceted ridge plots of summit probability across seasons, interactive geographic representations of route popularity versus success.
A multilevel logistic model provides effect-size estimates while accounting for peak-level heterogeneity.
Does an expedition’s funding affect safety outcomes?
Because direct cost metrics are absent, we construct a funding proxy that blends sponsorship type, total members & hired staff, oxygen usage, and expedition duration.
We then relate this composite to safety indicators like member and staff fatalities, high-altitude illnesses, rescue events via scatter-plot matrices, concentration curves, and Poisson regression. Stratified visualizations highlight differential risk patterns between commercial and self-organised teams.
Preliminary findings suggest that
- seasonality and route normalization remain the strongest predictors of summit success
- larger, commercially sponsored teams enjoy slightly higher success but only marginal safety advantages once altitude and weather are controlled
- resource intensive strategies such as extensive oxygen deployment reduce high-altitude illness yet do not eliminate fatal risk.
This project looks at data from many Himalayan mountain climbing trips. We use two main files: one has details about each expedition (called exped_tidy.csv), and the other has information about the mountains themselves (peaks_tidy.csv). The expedition data covers hundreds of trips over many years. For each trip, we know things like which mountain was climbed, how many people were in the team, if they used oxygen, if it was a commercial (guided) trip, and whether the team reached the summit or not.
The mountain (peaks) data gives us extra facts, like each mountain’s name, height, and what region it is in. By putting these two datasets together, we can explore how things like team size, the use of oxygen, type of trip, and which mountain was climbed affect the chances of reaching the summit.
Analyze expedition outcomes using both expedition and peak variables. “Success” will be defined based on the `TERMREASON_FACTOR`, with specific values representing a successful summit.
Use factors like – geography, team size, year, and commercial vs non-commercial expeditions
Climbing a Himalayan peak is a big challenge, and not every team that tries will reach the top. Some expeditions succeed, while others turn back before the summit. But what makes the difference? In this project, we want to understand what factors help climbers succeed and what might make them fail.
We have used following factors:
| Explore Team Size |
| Effect of Geography |
| Explore by Year |
| Explore Commercial Factor |
table(exped$SUCCESS1, useNA = "ifany")
exped2 <- exped |>
mutate(
success = ifelse(SUCCESS1 %in% c("Y", "Yes", "yes"), 1,
ifelse(SUCCESS1 %in% c("N", "No", "no"), 0, as.numeric(SUCCESS1))),
peak = as.factor(PEAKID),
team_size = as.numeric(TOTMEMBERS),
oxygen = ifelse(O2USED %in% c("Y", "Yes", "yes"), 1,
ifelse(O2USED %in% c("N", "No", "no"), 0, as.numeric(O2USED))),
season = as.factor(SEASON)
) |>
filter(!is.na(success), !is.na(team_size), !is.na(oxygen))
model <- glm(
success ~ peak + team_size + oxygen + season + YEAR,
data = exped2,
family = "binomial"
)The Overall Success vs Failure Rate out of all expeditions.
exped2 |>
mutate(outcome = ifelse(success == 1, "Success", "Failure")) |>
group_by(outcome) |>
summarize(n = n()) |>
mutate(rate = n / sum(n)) |>
ggplot(aes(x = outcome, y = rate, fill = outcome)) +
geom_col(width = 0.5) +
scale_y_continuous(labels = scales::percent) +
scale_fill_manual(values = c("Success" = "forestgreen", "Failure" = "firebrick")) +
labs(
title = "Overall Expedition Success vs Failure Rate",
x = "Expedition Outcome",
y = "Proportion"
) +
theme_minimal()We will study with following variables:
| Potential Factors to Study Success Rate |
|---|
| Explore Team Size |
| Effect of Geography |
| Explore by Year |
| Explore Commercial Factor |
ggplot(exped2, aes(x = team_size, fill = factor(success))) +
geom_bar(position = "fill") +
scale_x_continuous(
breaks = seq(0, max(exped2$team_size, na.rm = TRUE), by = 10),
expand = expansion(add = c(0, 0))
) +
scale_y_continuous(labels = scales::percent_format()) +
labs(
x = "Team Size",
y = "Proportion",
fill = "Success\n(1 = yes)",
title = "Proportion of Success vs. Failure\nby Team Size"
) +
theme(
axis.text.x = element_text(angle = 90, vjust = 0.5),
plot.title = element_text(hjust = 0.5)
)df_bin <- exped2 |>
group_by(team_size) |>
summarize(rate = mean(success), n = n(), .groups="drop") |>
filter(n > 3)
ggplot(df_bin, aes(x = team_size, y = rate)) +
geom_segment(aes(xend = team_size, y = 0, yend = rate), color = "grey70") +
geom_point(size = 2) +
scale_y_continuous(labels = scales::percent_format()) +
labs(
x = "Team Size",
y = "Success Rate",
title = "Lollipop Chart of Success Rate by Team Size",
caption = "Source: Himalayan Database (2020–2024)"
)These plots suggests a non-linear relationship between team size and summit success:
Very small teams (1–5 climbers) tend to have very high success rates.
Probable cause - agility, lighter logistical needs, and tighter decision-making.
Small-to-medium teams (3–8 climbers) show a noticeable dip in success.
Probable cause - enough complexity to introduce friction but not enough manpower to provide robust support.
Medium-large teams (9–20 climbers) exhibit steadily increasing success rates, peaking around 20 members at or near 100%.
Probable cause - have enough members to spread tasks, establish well-stocked camps without yet suffering the full burden of very large-group logistics.
Very large teams (20+) see a gradual decline in success.
Probable cause - supplies, campsite congestion, decision latency.
peak_success <- exped2 |>
left_join(peaks |> select(PEAKID, PKNAME, HEIGHTM), by = "PEAKID") |>
group_by(PEAKID, PKNAME, HEIGHTM) |>
summarize(success_rate = mean(success == 1, na.rm = TRUE), n = n()) |>
filter(n > 10)
ggplot(peak_success, aes(x = HEIGHTM, y = success_rate, label = PKNAME)) +
geom_point(color = "steelblue", size = 3, alpha = 0.7) +
geom_text(vjust = -0.8, size = 2.5, check_overlap = TRUE) +
labs(
title = "Altitude vs. Success Rate Across Peaks",
x = "Altitude (meters)",
y = "Success Rate"
) + geom_smooth(method = "loess", se = FALSE, color = "darkred")peak_success <- exped2 |>
left_join(peaks |> select(PEAKID, PKNAME, HEIGHTM), by = "PEAKID") |>
group_by(PKNAME, HEIGHTM) |>
summarize(success_rate = mean(success == 1, na.rm = TRUE), n = n()) |>
filter(n > 10)
ggplot(peak_success, aes(x = reorder(PKNAME, HEIGHTM), y = HEIGHTM, color = success_rate, size = n)) +
geom_point(alpha = 0.8) +
scale_color_gradient(low = "red", high = "green") +
labs(
title = "Altitude and Success Rate Across Peaks",
x = "Peak",
y = "Altitude (meters)",
color = "Success Rate",
size = "Number of Expeditions"
) +
theme_minimal() +
theme(
axis.text.x = element_text(angle = 90, hjust = 1)
)Geography, in particular the altitude band of a peak—turns out not to be a simple “the higher you go, the harder it gets” story.
Low-altitude peaks (~6800 m, e.g. Ama Dablam) enjoy very high success rates, thanks to gentler slopes, shorter approaches and straightforward rope-fixed “ladder” routes
Mid-altitude peaks (~7100–7300 m, like Pumori and Baruntse) dips to the bottom of the curve,mountains are often more technical, less commercialized, with fewer fixed ropes and smaller teams—so even though they’re “lower,” they prove tougher
Higher Peaks (8000 m + like Everest) enjoys very high success rates, as they benefit from massive infrastructure (high-traffic base camps, helicopter support etc.), which more than offsets the physiological challenge of extreme altitude
Overall, a peak’s infrastructure and traffic are at least as critical to success as its height.
yearly_success <- exped2 |>
group_by(YEAR) |>
summarize(
success_rate = mean(success == 1, na.rm = TRUE),
n = n()
) |>
filter(n > 1)
ggplot(yearly_success, aes(x = YEAR, y = success_rate)) +
geom_line(color = "steelblue", size = 1.2) +
geom_point(color = "darkred", size = 2) +
geom_smooth(method = "loess", se = FALSE, color = "forestgreen", linetype = "dashed") +
labs(
title = "Expedition Success Rate by Year (with Trend)",
x = "Year",
y = "Success Rate"
) +
scale_y_continuous(labels = scales::percent) +
theme_minimal()peak_year <- exped2 |>
left_join(peaks |> select(PEAKID, PKNAME), by = "PEAKID") |>
group_by(YEAR, PKNAME) |>
summarize(success_rate = mean(success == 1, na.rm = TRUE), n = n()) |>
filter(n > 3)
# Plot heatmap
ggplot(peak_year, aes(x = YEAR, y = reorder(PKNAME, -success_rate), fill = success_rate)) +
geom_tile(color = "white") +
scale_fill_gradient(low = "red", high = "green") +
labs(
title = "Success Rate by Year and Peak",
x = "Year",
y = "Peak",
fill = "Success Rate"
) +
theme_minimal()These plots suggests, despite a pandemic-driven slump in 2021, overall expedition success has rebounded strongly by 2024
2020 average success rate ~ 72%
Probable cause – pre-pandemic normalcy, full staffing, established logistics.
2021-2022 average success rate fell to ~ 63%
Probable cause – COVID travel restrictions & teams adapted to new health protocols and smaller expedition sizes.
2023-2024 average success rate rebound to ~ 80%
Probable cause – full operational recovery, improved high-altitude gear, accumulated guiding experience.
exped2 <- exped2 |>
mutate(
commercial = case_when(
COMRTE == TRUE ~ "Commercial",
COMRTE == FALSE ~ "Non-Commercial",
TRUE ~ NA_character_
)
)
comm_year <- exped2 |>
filter(!is.na(commercial)) |>
group_by(YEAR, commercial) |>
summarize(success_rate = mean(success == 1, na.rm = TRUE), n = n()) |>
filter(n > 3)
ggplot(comm_year, aes(x = YEAR, y = success_rate, color = commercial)) +
geom_line(size = 1.1) +
geom_point() +
scale_color_manual(values = c("Commercial" = "skyblue", "Non-Commercial" = "gray40")) +
scale_y_continuous(labels = scales::percent) +
labs(
title = "Success Rate by Year: Commercial vs Non-Commercial",
x = "Year",
y = "Success Rate",
color = "Expedition Type"
) +
theme_minimal()exped2 |>
filter(!is.na(commercial)) |>
left_join(peaks |> select(PEAKID, PKNAME), by = "PEAKID") |>
group_by(PKNAME, commercial) |>
summarize(success_rate = mean(success == 1, na.rm = TRUE), n = n()) |>
filter(n > 5) |>
ggplot(aes(x = reorder(PKNAME, -success_rate), y = success_rate, fill = commercial)) +
geom_col(position = "dodge") +
coord_flip() +
scale_fill_manual(values = c("Commercial" = "skyblue", "Non-Commercial" = "gray70")) +
labs(
title = "Success Rate by Peak and Expedition Type",
x = "Peak",
y = "Success Rate",
fill = "Type"
) +
theme_minimal()exped2 |>
filter(!is.na(commercial)) |>
group_by(team_size = TOTMEMBERS, commercial) |>
summarize(success_rate = mean(success == 1, na.rm = TRUE), n = n()) |>
filter(n > 3, !is.na(team_size)) |>
ggplot(aes(x = team_size, y = success_rate, color = commercial)) +
geom_point(size = 2, alpha = 0.7) +
geom_smooth(se = FALSE) +
scale_color_manual(values = c("Commercial" = "skyblue", "Non-Commercial" = "gray70")) +
scale_y_continuous(labels = scales::percent) +
labs(
title = "Team Size vs Success Rate by Expedition Type",
x = "Team Size",
y = "Success Rate",
color = "Expedition Type"
) +
theme_minimal()These plots suggests, A professional, well-resourced support structure not only accelerates recovery from external shocks (like a pandemic) but also delivers uniformly higher success rates—regardless of mountain or team size.
Commercial expeditions rebound faster and to a higher plateau
In 2021, both types dipped (COVID impact), but commercial climbs dropped only to ~67%, whereas non-commercial fell to ~53%. By 2024, commercial success surged to ~95 %, while non-commercial languished around ~64%.
Across individual peaks, company-run trips consistently outshine independent ones
On major peaks (Everest, Lhotse, Ama Dablam), commercial success rates hover in the 85–95 % range. Non-commercial attempts on the same mountains are 5–10 % lower.
Team-size effects differ by expedition type
Commercial: success stays high (75–100 %) across all team sizes, peaking around 20 members.
Non-commercial: very small groups (> 80 %) and very large groups (~75 %) do better, but mid-sized teams (5–10 climbers) dip down near 50 %
The analysis will be very similar for this question. I suspect there will be some overlap, funding and success rate etc. We will use many of the same variables, but more specifically, FUNDING_TYPE.
We will Compare specific funding types like commercial vs. independent expeditions across countries for more insights
NOTE: Add “real world” examples of expedition success or failure to make analysis relatable.
# Lean tibble of variables of interest
exped_q2 <- exped |>
select(
YEAR, # Year of the expedition
SPONSOR, # Sponsor or funding organization
SUCCESS1, # Success on primary route
SUCCESS2, # Success on second route
SUCCESS3, # Success on third route
SUCCESS4, # Success on fourth route
TERMDATE, # Date expedition was terminated
TERMREASON, # Numeric code for termination reason
TERMREASON_FACTOR, # Description of termination reason
TOTMEMBERS, # Total members in the team
MDEATHS, # Number of member deaths
HDEATHS, # Hired staff deaths
OTHERSMTS, # Other summits during expedition
ACCIDENTS, # Description of accidents
AGENCY, # Trekking/expedition agency
O2USED #
) |>
mutate(
ANY_SUCCESS = SUCCESS1 | SUCCESS2 | SUCCESS3 | SUCCESS4,
FATALITIES = (MDEATHS + HDEATHS > 0),
OUTCOME_LABEL = case_when(
ANY_SUCCESS & !FATALITIES ~ "Success, No Deaths",
ANY_SUCCESS & FATALITIES ~ "Success + Deaths",
!ANY_SUCCESS & FATALITIES ~ "Failure + Deaths",
TRUE ~ "Failure, No Deaths"
)
) |>
select(-SUCCESS1, -SUCCESS2, -SUCCESS3, -SUCCESS4, -MDEATHS, -HDEATHS)
exped_q2 <- exped_q2 |>
mutate(
TERMINATION_TYPE = case_when(
TERMREASON %in% c(1, 2, 3) ~ "Successful",
TERMREASON %in% c(4, 5) ~ "Bad Weather/Conditions",
TERMREASON %in% c(6, 7) ~ "Accident/Illness/Exhaustion",
TERMREASON %in% c(8, 9, 10) ~ "Logistics/Technical",
TERMREASON %in% c(11, 12, 13) ~ "No Attempt/Base Only",
TERMREASON == 14 ~ "Other",
TRUE ~ "Unknown"
)
)library(stringr)
# Source: https://stackoverflow.com/questions/59082243/multiple-patterns-for-string-matching-using-case-when
sponsor_prefix_tbl <- exped_q2 |>
mutate(
SPONSOR_CLEAN = str_to_lower(str_trim(SPONSOR)),
SPONSOR_CLEAN = str_remove(SPONSOR_CLEAN, "\\s*\\b20\\d{2}\\b$"),
SPONSOR_PREFIX = str_extract(SPONSOR_CLEAN, "^\\w+\\s*\\w*")
) |>
count(SPONSOR_PREFIX, sort = TRUE)
#Sponsors <- exped_q2 |>
#distinct(SPONSOR_CLEAN) #|>
#count()
exped_q2 <- exped_q2 |>
mutate(
SPONSOR_CLEAN = str_to_lower(str_trim(SPONSOR)),
SPONSOR_CLEAN = str_remove(SPONSOR_CLEAN, "\\s*\\b20\\d{2}\\b$"),
SPONSOR_PREFIX = str_extract(SPONSOR_CLEAN, "^\\w+\\s*\\w*"),
FUNDING_TYPE = case_when(
str_detect(SPONSOR_PREFIX, "army|police|military|defense") ~ "Military",
str_detect(SPONSOR_PREFIX, "university|college|school|club|jac|univ") ~ "Academic/Alpine Club",
str_detect(SPONSOR_PREFIX, "guides|treks|adventure|travel|ascent|trip|climbalaya|madison|imagine|kobler|makalu|highland|satori|elite|glacier|dream|thamserku") ~ "Commercial",
str_detect(SPONSOR_PREFIX, "private|individual|self|solo|1st|friends|father|jon|jost|alex|kilian|marc|kishori|soren") ~ "Private/Individual",
str_detect(SPONSOR_PREFIX, "national geographic|japanese|korean|slovakian|french|chinese|indian|nepalese|russian|german|austrian|canadian|italian|spanish|swiss|american|latvian") ~ "National Program",
is.na(SPONSOR_PREFIX) | SPONSOR_PREFIX == "" ~ "Other/Unknown",
TRUE ~ "Other/Unknown"
)
)|>
mutate(
FUNDING_SIMPLIFIED = case_when(
FUNDING_TYPE == "Commercial" ~ "Commercial",
FUNDING_TYPE %in% c("Academic/Alpine Club", "National Program", "Military") ~ "Non-commercial",
FUNDING_TYPE %in% c("Private/Individual", "Other/Unknown") ~ "Independent/Other"
)
)
#glimpse(exped_q2)Commercial expeditions are often seen as safer due to better logistics, guides, and resources.
The following analysis aims to address this assumption by comparing commercial, private, and non-commercial groups.
To determine funding source the data was wrangled and sorted using the “SPONSOR” column. This was accomplished by repeatedly editing the sorted list. The sponsor column were nearly all unique. I found an example here:https://stackoverflow.com/questions/59082243/multiple-patterns-for-string-matching-using-case-when
# Summarize safety metrics by funding type
library(dplyr)
library(knitr)
exped_q2 |>
group_by(FUNDING_SIMPLIFIED) |>
summarise(
n_expeditions = n(),
n_success = sum(ANY_SUCCESS),
n_fatalities = sum(FATALITIES),
pct_success = mean(ANY_SUCCESS),
pct_fatal = mean(FATALITIES),
.groups = "drop"
) |>
mutate(
pct_success = scales::percent(pct_success, accuracy = 0.1),
pct_fatal = scales::percent(pct_fatal, accuracy = 0.1)
) |>
kable(
caption = "Expedition Outcomes by Funding Type (2020–2024)",
col.names = c(
"Funding Type",
"Total Expeditions",
"Number Successful",
"Number with Fatalities",
"Success Rate",
"Fatality Rate"
),
format = "html"
)Let’s Start the plotting
These figures shows that proportionally, accidents are more common with None-commercial expeditions. Termination due to bad weather appears uniform across all groups, which is expected.
ggplot(exped_q2, aes(
x = FUNDING_SIMPLIFIED,
fill = TERMINATION_TYPE
)) +
geom_bar(position = "fill") +
scale_fill_brewer(
palette = "YlGnBu",
direction = 1,
name = "Termination reason"
) +
scale_y_continuous(labels = scales::percent_format()) +
labs(
x = NULL,
y = NULL,
fill = "Termination reason",
title = "How do expeditions end?",
subtitle = "By funding type, Himalayan expeditions 2020–2024",
caption = "Source: Himalayan Database via TidyTuesday (2025-01-21)"
) +
theme_minimal(base_size = 14) +
theme(
axis.text.x = element_text(size = 6),
axis.text.y = element_text(size = 6),
plot.title = element_text(face = "bold"),
plot.subtitle = element_text(margin = margin(b = 6), size = 8),
plot.caption = element_text(color = "gray40", size = 6),
legend.position = "right",
panel.grid.major.x = element_blank()
)ggplot(exped_q2, aes(x = FUNDING_SIMPLIFIED, fill = OUTCOME_LABEL)) +
geom_bar(position = "fill") +
scale_y_continuous(labels = scales::percent) +
scale_fill_brewer(palette = "YlGnBu", direction = 1) +
labs(
title = "Expedition Outcomes by Funding Source",
subtitle = "Independent groups have more fatal or failed climbs",
x = NULL,
y = "Proportion of Expeditions",
fill = "Outcome"
)ggplot(exped_q2, aes(x = FUNDING_SIMPLIFIED, fill = OUTCOME_LABEL)) +
geom_bar(position = "fill") +
scale_fill_brewer(palette = "YlGnBu", direction = 1) +
scale_y_continuous(labels = scales::percent_format()) +
labs(
x = NULL,
y = "Proportion of expeditions",
title = "Distribution of Expedition Outcomes by Funding Type",
subtitle = "Includes success and death combinations for Himalayan expeditions (2020–2024)",
fill = "Outcome",
caption = "Source: Himalayan Database via TidyTuesday (2025-01-21)"
) +
theme_minimal(base_size = 14) +
theme(
axis.text.x = element_text(size = 10),
axis.text.y = element_text(size = 10),
legend.position = "right",
panel.grid.major.x = element_blank()
)exped_q2 |>
group_by(YEAR, FUNDING_SIMPLIFIED) |>
summarise(fatality_rate = mean(FATALITIES, na.rm = TRUE), .groups = "drop") |>
ggplot(aes(x = YEAR, y = fatality_rate, color = FUNDING_SIMPLIFIED)) +
geom_line(size = 1.2) +
geom_smooth(method = "loess", se = FALSE, linewidth = 1, linetype = "dashed") +
scale_color_brewer(palette = "YlGnBu", direction = 1) +
scale_y_continuous(labels = scales::percent_format()) +
labs(
title = "Trend in Member Fatalities by Funding Type",
subtitle = "Fatality rate = expeditions with ≥1 member or hired death",
x = "Year",
y = "Fatality Rate (%)",
color = "Funding Type",
caption = "Source: Himalayan Database via TidyTuesday (2025-01-21)"
) +
theme_minimal(base_size = 14)It’s one thing to say commercial expeditions are safer but is that always true? I wanted to check whether fatality rates have changed over time, and if so, whether some groups are improving faster than others. This smoothed line chart shows the share of expeditions with fatalities by year and funding type. This is overlap with question 1.
Next question: does team size make a difference? You’d think larger teams might be safer due to support and redundancy, but also possibly riskier if they’re less experienced or move slowly. This scatterplot checks whether total team size correlates with success, failure, or fatalities — and whether that varies by funding type.
ggplot(exped_q2, aes(x = TOTMEMBERS, y = OUTCOME_LABEL, color = FUNDING_SIMPLIFIED)) +
geom_jitter(width = 0.3, height = 0.2, alpha = 0.6, size = 2) +
scale_color_brewer(palette = "YlGnBu", direction = 1) +
labs(
x = "Total team members",
y = "Outcome type",
title = "Team Size and Expedition Outcome by Funding Type",
subtitle = "Each point is one expedition (2020–2024)",
color = "Funding Type",
caption = "Source: Himalayan Database via TidyTuesday (2025-01-21)"
) +
theme_minimal(base_size = 14)Large commercial teams often come with more resources: guides, oxygen, fixed ropes, medical support. That could partly explain why they have lower fatality rates.
Meanwhile, small non-commercial teams might:
Take harder routes (fewer sherpa or porters)
Push limits with less redundancy
Face more exposure if things go wrong
So while team size alone doesn’t determine safety, it’s a proxy for resourcing — and resourcing links back to safety.
library(ggridges)
exped_q2 |>
filter(TOTMEMBERS < 50) |> # Trim extreme outliers for better scaling
ggplot(aes(
x = TOTMEMBERS,
y = FUNDING_SIMPLIFIED,
fill = FUNDING_SIMPLIFIED
)) +
geom_density_ridges(
scale = 1.2,
rel_min_height = 0.01,
alpha = 0.6,
color = "white"
) +
scale_fill_brewer(palette = "YlGnBu", direction = 1) +
labs(
x = "Team Size (Total Members)",
y = "Funding Type",
title = "Distribution of Team Size by Funding Type",
subtitle = "Commercial expeditions tend to bring more people",
caption = "Source: Himalayan Database via TidyTuesday (2025-01-21)"
) +
theme_minimal(base_size = 14) +
theme(legend.position = "none")---
title: "A Statistical Exploration of Himalayan Expeditions"
subtitle: "INFO 526 Final Project Proposal"
author:
- name: "Trevor Macdonald & Nandakumar Kuthalaraja"
affiliations:
- name: "School of Information, University of Arizona"
description: >
Analysis of Himalayan expeditions from 2020–2024
format:
html:
code-tools: true
code-overflow: wrap
embed-resources: true
editor: visual
execute:
warning: false
echo: false
---
```{r}
#| label: load-packages
#| include: false
#| echo: false
# Load packages here
pacman::p_load(tidymodels,
tidyverse, patchwork, scales, ggplot2)
```
```{r}
#| label: setup
#| include: false
#| echo: false
ggplot2::theme_set(ggplot2::theme_minimal(base_size = 11))
knitr::opts_chunk$set(
fig.retina = 3,
dpi = 300,
fig.width = 6,
fig.asp = 0.618
)
```
## Abstract
High-altitude mountaineering offers a natural laboratory for studying how human, environmental, and logistical factors interact to shape risk and achievement. Using tidy versions of the Himalayan Database, this project leverages R (tidyverse, tidy models) and Quarto to build an interactive, reproducible data-visualisation narrative around two guiding questions:
- **What factors contribute to the success or failure of a summit?**
- We knit together expedition, peak, and temporal attributes to explore how season, route choice, peak height, team size, leader nationality, weather windows, and oxygen strategy correlate with summit outcomes. Visual encodings include faceted ridge plots of summit probability across seasons, interactive geographic representations of route popularity versus success.
- A multilevel logistic model provides effect-size estimates while accounting for peak-level heterogeneity.
- **Does an expedition’s funding affect safety outcomes?**
- Because direct cost metrics are absent, we construct a funding proxy that blends sponsorship type, total members & hired staff, oxygen usage, and expedition duration.
- We then relate this composite to safety indicators like member and staff fatalities, high-altitude illnesses, rescue events via scatter-plot matrices, concentration curves, and Poisson regression. Stratified visualizations highlight differential risk patterns between commercial and self-organised teams.
Preliminary findings suggest that
\- seasonality and route normalization remain the strongest predictors of summit success
\- larger, commercially sponsored teams enjoy slightly higher success but only marginal safety advantages once altitude and weather are controlled
\- resource intensive strategies such as extensive oxygen deployment reduce high-altitude illness yet do not eliminate fatal risk.
## Introduction
This project looks at data from many Himalayan mountain climbing trips. We use two main files: one has details about each expedition (called `exped_tidy.csv`), and the other has information about the mountains themselves (`peaks_tidy.csv`). The expedition data covers hundreds of trips over many years. For each trip, we know things like which mountain was climbed, how many people were in the team, if they used oxygen, if it was a commercial (guided) trip, and whether the team reached the summit or not.
The mountain (peaks) data gives us extra facts, like each mountain’s name, height, and what region it is in. By putting these two datasets together, we can explore how things like team size, the use of oxygen, type of trip, and which mountain was climbed affect the chances of reaching the summit.
## Question 1: What factors contribute to the success or failure of a summit?
- Analyze expedition outcomes using both expedition and peak variables. "Success" will be defined based on the \`TERMREASON_FACTOR\`, with specific values representing a successful summit.
- Use factors like -- geography, team size, year, and commercial vs non-commercial expeditions
### Introduction
Climbing a Himalayan peak is a big challenge, and not every team that tries will reach the top. Some expeditions succeed, while others turn back before the summit. But what makes the difference? In this project, we want to understand what factors help climbers succeed and what might make them fail.
We have used following factors:
| |
|---------------------------|
| Explore Team Size |
| Effect of Geography |
| Explore by Year |
| Explore Commercial Factor |
### Approach
- Get Data from the CSV files
```{r}
#| code-fold: true
#| code-summary: "Code"
#| echo: true
#| results: hide
#| message: false
#| warning: false
exped <- read_csv("data/exped_tidy.csv")
peaks <- read_csv("data/peaks_tidy.csv")
colnames(peaks)
head(peaks)
colnames(exped)
head(exped)
table(exped$SUCCESS1, useNA = "ifany")
```
- Build Model & Variable Inferences
```{r}
#| code-fold: true
#| code-summary: "Code"
#| echo: true
#| results: hide
#| message: false
#| warning: false
table(exped$SUCCESS1, useNA = "ifany")
exped2 <- exped |>
mutate(
success = ifelse(SUCCESS1 %in% c("Y", "Yes", "yes"), 1,
ifelse(SUCCESS1 %in% c("N", "No", "no"), 0, as.numeric(SUCCESS1))),
peak = as.factor(PEAKID),
team_size = as.numeric(TOTMEMBERS),
oxygen = ifelse(O2USED %in% c("Y", "Yes", "yes"), 1,
ifelse(O2USED %in% c("N", "No", "no"), 0, as.numeric(O2USED))),
season = as.factor(SEASON)
) |>
filter(!is.na(success), !is.na(team_size), !is.na(oxygen))
model <- glm(
success ~ peak + team_size + oxygen + season + YEAR,
data = exped2,
family = "binomial"
)
```
- Start plotting various charts building up to the case
### Start
::::: columns
::: {.column width="55%"}
The Overall Success vs Failure Rate out of all expeditions.
```{r}
#| code-fold: true
#| code-summary: "Code"
#| echo: true
#| results: hide
#| message: false
#| warning: false
exped2 |>
mutate(outcome = ifelse(success == 1, "Success", "Failure")) |>
group_by(outcome) |>
summarize(n = n()) |>
mutate(rate = n / sum(n)) |>
ggplot(aes(x = outcome, y = rate, fill = outcome)) +
geom_col(width = 0.5) +
scale_y_continuous(labels = scales::percent) +
scale_fill_manual(values = c("Success" = "forestgreen", "Failure" = "firebrick")) +
labs(
title = "Overall Expedition Success vs Failure Rate",
x = "Expedition Outcome",
y = "Proportion"
) +
theme_minimal()
```
:::
::: {.column width="45%"}
We will study with following variables:
| Potential Factors to Study Success Rate |
|-----------------------------------------|
| Explore Team Size |
| Effect of Geography |
| Explore by Year |
| Explore Commercial Factor |
:::
:::::
### Explore Team Size
::::: columns
::: {.column width="50%"}
```{r}
#| code-fold: true
#| code-summary: "Code"
#| echo: true
#| results: hide
#| message: false
#| warning: false
#|
ggplot(exped2, aes(team_size)) +
geom_histogram(binwidth = 1, fill = "steelblue", color = "white") +
facet_wrap(~ success, labeller = labeller(success = c(`0` = "Failure", `1` = "Success"))) +
labs(
x = "Team Size",
y = "Count",
title = "Distribution of Team Sizes\nfor Success vs. Failure"
)
```
:::
::: {.column width="50%"}
```{r}
#| code-fold: true
#| code-summary: "Code"
#| echo: true
#| results: hide
#| message: false
#| warning: false
ggplot(exped2, aes(x = team_size, fill = factor(success))) +
geom_bar(position = "fill") +
scale_x_continuous(
breaks = seq(0, max(exped2$team_size, na.rm = TRUE), by = 10),
expand = expansion(add = c(0, 0))
) +
scale_y_continuous(labels = scales::percent_format()) +
labs(
x = "Team Size",
y = "Proportion",
fill = "Success\n(1 = yes)",
title = "Proportion of Success vs. Failure\nby Team Size"
) +
theme(
axis.text.x = element_text(angle = 90, vjust = 0.5),
plot.title = element_text(hjust = 0.5)
)
```
:::
:::::
:::::: columns
::: {.column width="25%"}
:::
::: {.column width="50%" style="\"min-height:100px;"}
```{r, fig.height=3}
#| code-fold: true
#| code-summary: "Code"
#| echo: true
#| results: hide
#| message: false
#| warning: false
df_bin <- exped2 |>
group_by(team_size) |>
summarize(rate = mean(success), n = n(), .groups="drop") |>
filter(n > 3)
ggplot(df_bin, aes(x = team_size, y = rate)) +
geom_segment(aes(xend = team_size, y = 0, yend = rate), color = "grey70") +
geom_point(size = 2) +
scale_y_continuous(labels = scales::percent_format()) +
labs(
x = "Team Size",
y = "Success Rate",
title = "Lollipop Chart of Success Rate by Team Size",
caption = "Source: Himalayan Database (2020–2024)"
)
```
:::
::: {.column width="25%"}
<!-- blank right spacer -->
:::
::::::
### Explore Team Size Analysis
::: {style="font-size:0.8em; line-height:1.2;"}
`These plots suggests a non-linear relationship between team size and summit success:`
- **`Very small teams (1–5 climbers)`** tend to have very high success rates.
Probable cause - agility, lighter logistical needs, and tighter decision-making.
- **`Small-to-medium teams (3–8 climbers)`** show a noticeable dip in success.
Probable cause - enough complexity to introduce friction but not enough manpower to provide robust support.
- **`Medium-large teams (9–20 climbers)`** exhibit steadily increasing success rates, peaking around 20 members at or near 100%.
Probable cause - have enough members to spread tasks, establish well-stocked camps without yet suffering the full burden of very large-group logistics.
- **`Very large teams (20+)`** see a gradual decline in success.
Probable cause - supplies, campsite congestion, decision latency.
:::
### Effect of Geography
::::: columns
::: {.column width="50%"}
```{r}
#| code-fold: true
#| code-summary: "Code"
#| echo: true
#| results: hide
#| message: false
#| warning: false
peak_success <- exped2 |>
left_join(peaks |> select(PEAKID, PKNAME, HEIGHTM), by = "PEAKID") |>
group_by(PEAKID, PKNAME, HEIGHTM) |>
summarize(success_rate = mean(success == 1, na.rm = TRUE), n = n()) |>
filter(n > 10)
ggplot(peak_success, aes(x = HEIGHTM, y = success_rate, label = PKNAME)) +
geom_point(color = "steelblue", size = 3, alpha = 0.7) +
geom_text(vjust = -0.8, size = 2.5, check_overlap = TRUE) +
labs(
title = "Altitude vs. Success Rate Across Peaks",
x = "Altitude (meters)",
y = "Success Rate"
) + geom_smooth(method = "loess", se = FALSE, color = "darkred")
```
:::
::: {.column width="50%"}
```{r}
#| code-fold: true
#| code-summary: "Code"
#| echo: true
#| results: hide
#| message: false
#| warning: false
peak_success <- exped2 |>
left_join(peaks |> select(PEAKID, PKNAME, HEIGHTM), by = "PEAKID") |>
group_by(PKNAME, HEIGHTM) |>
summarize(success_rate = mean(success == 1, na.rm = TRUE), n = n()) |>
filter(n > 10)
ggplot(peak_success, aes(x = reorder(PKNAME, HEIGHTM), y = HEIGHTM, color = success_rate, size = n)) +
geom_point(alpha = 0.8) +
scale_color_gradient(low = "red", high = "green") +
labs(
title = "Altitude and Success Rate Across Peaks",
x = "Peak",
y = "Altitude (meters)",
color = "Success Rate",
size = "Number of Expeditions"
) +
theme_minimal() +
theme(
axis.text.x = element_text(angle = 90, hjust = 1)
)
```
:::
:::::
:::::: columns
::: {.column width="25%"}
:::
::: {.column width="50%" style="\"min-height:60px;"}
```{r, fig.height=3}
```
:::
::: {.column width="25%"}
<!-- blank right spacer -->
:::
::::::
### Effect of Geography Analysis
::: {style="font-size:0.8em; line-height:1.2;"}
`Geography, in particular the altitude band of a peak—turns out not to be a simple “the higher you go, the harder it gets” story.`
- **`Low-altitude peaks (~6800 m, e.g. Ama Dablam)`** enjoy very high success rates, thanks to gentler slopes, shorter approaches and straightforward rope-fixed “ladder” routes
- **`Mid-altitude peaks (~7100–7300 m, like Pumori and Baruntse)`** dips to the bottom of the curve,mountains are often more technical, less commercialized, with fewer fixed ropes and smaller teams—so even though they’re “lower,” they prove tougher
- **`Higher Peaks (8000 m + like Everest)`** enjoys very high success rates, as they benefit from massive infrastructure (high-traffic base camps, helicopter support etc.), which more than offsets the physiological challenge of extreme altitude
Overall, a peak’s infrastructure and traffic are at least as critical to success as its height.
:::
### Explore by Year
::::: columns
::: {.column width="50%"}
```{r}
#| code-fold: true
#| code-summary: "Code"
#| echo: true
#| results: hide
#| message: false
#| warning: false
yearly_success <- exped2 |>
group_by(YEAR) |>
summarize(
success_rate = mean(success == 1, na.rm = TRUE),
n = n()
) |>
filter(n > 1)
ggplot(yearly_success, aes(x = YEAR, y = success_rate)) +
geom_line(color = "steelblue", size = 1.2) +
geom_point(color = "darkred", size = 2) +
geom_smooth(method = "loess", se = FALSE, color = "forestgreen", linetype = "dashed") +
labs(
title = "Expedition Success Rate by Year (with Trend)",
x = "Year",
y = "Success Rate"
) +
scale_y_continuous(labels = scales::percent) +
theme_minimal()
```
:::
::: {.column width="50%"}
```{r}
#| code-fold: true
#| code-summary: "Code"
#| echo: true
#| results: hide
#| message: false
#| warning: false
peak_year <- exped2 |>
left_join(peaks |> select(PEAKID, PKNAME), by = "PEAKID") |>
group_by(YEAR, PKNAME) |>
summarize(success_rate = mean(success == 1, na.rm = TRUE), n = n()) |>
filter(n > 3)
# Plot heatmap
ggplot(peak_year, aes(x = YEAR, y = reorder(PKNAME, -success_rate), fill = success_rate)) +
geom_tile(color = "white") +
scale_fill_gradient(low = "red", high = "green") +
labs(
title = "Success Rate by Year and Peak",
x = "Year",
y = "Peak",
fill = "Success Rate"
) +
theme_minimal()
```
:::
:::::
:::::: columns
::: {.column width="25%"}
:::
::: {.column width="50%" style="\"min-height:100px;"}
```{r, fig.height=3}
```
:::
::: {.column width="25%"}
<!-- blank right spacer -->
:::
::::::
### Explore by Year Analysis
::: {style="font-size:0.8em; line-height:1.2;"}
`These plots suggests, despite a pandemic-driven slump in 2021, overall expedition success has rebounded strongly by 2024`
- **`2020`** average success rate \~ 72%
Probable cause – pre-pandemic normalcy, full staffing, established logistics.
- **`2021-2022`** average success rate fell to \~ 63%
Probable cause – COVID travel restrictions & teams adapted to new health protocols and smaller expedition sizes.
- **`2023-2024`** average success rate rebound to \~ 80%
Probable cause – full operational recovery, improved high-altitude gear, accumulated guiding experience.
:::
### Commercial Factor
::::: columns
::: {.column width="50%"}
```{r}
#| code-fold: true
#| code-summary: "Code"
#| echo: true
#| results: hide
#| message: false
#| warning: false
exped2 <- exped2 |>
mutate(
commercial = case_when(
COMRTE == TRUE ~ "Commercial",
COMRTE == FALSE ~ "Non-Commercial",
TRUE ~ NA_character_
)
)
comm_year <- exped2 |>
filter(!is.na(commercial)) |>
group_by(YEAR, commercial) |>
summarize(success_rate = mean(success == 1, na.rm = TRUE), n = n()) |>
filter(n > 3)
ggplot(comm_year, aes(x = YEAR, y = success_rate, color = commercial)) +
geom_line(size = 1.1) +
geom_point() +
scale_color_manual(values = c("Commercial" = "skyblue", "Non-Commercial" = "gray40")) +
scale_y_continuous(labels = scales::percent) +
labs(
title = "Success Rate by Year: Commercial vs Non-Commercial",
x = "Year",
y = "Success Rate",
color = "Expedition Type"
) +
theme_minimal()
```
:::
::: {.column width="50%"}
```{r}
#| code-fold: true
#| code-summary: "Code"
#| echo: true
#| results: hide
#| message: false
#| warning: false
exped2 |>
filter(!is.na(commercial)) |>
left_join(peaks |> select(PEAKID, PKNAME), by = "PEAKID") |>
group_by(PKNAME, commercial) |>
summarize(success_rate = mean(success == 1, na.rm = TRUE), n = n()) |>
filter(n > 5) |>
ggplot(aes(x = reorder(PKNAME, -success_rate), y = success_rate, fill = commercial)) +
geom_col(position = "dodge") +
coord_flip() +
scale_fill_manual(values = c("Commercial" = "skyblue", "Non-Commercial" = "gray70")) +
labs(
title = "Success Rate by Peak and Expedition Type",
x = "Peak",
y = "Success Rate",
fill = "Type"
) +
theme_minimal()
```
:::
:::::
:::::: columns
::: {.column width="25%"}
:::
::: {.column width="50%" style="\"min-height:100px;"}
```{r, fig.height=3}
#| code-fold: true
#| code-summary: "Code"
#| echo: true
#| results: hide
#| message: false
#| warning: false
exped2 |>
filter(!is.na(commercial)) |>
group_by(team_size = TOTMEMBERS, commercial) |>
summarize(success_rate = mean(success == 1, na.rm = TRUE), n = n()) |>
filter(n > 3, !is.na(team_size)) |>
ggplot(aes(x = team_size, y = success_rate, color = commercial)) +
geom_point(size = 2, alpha = 0.7) +
geom_smooth(se = FALSE) +
scale_color_manual(values = c("Commercial" = "skyblue", "Non-Commercial" = "gray70")) +
scale_y_continuous(labels = scales::percent) +
labs(
title = "Team Size vs Success Rate by Expedition Type",
x = "Team Size",
y = "Success Rate",
color = "Expedition Type"
) +
theme_minimal()
```
:::
::: {.column width="25%"}
<!-- blank right spacer -->
:::
::::::
### Commercial Factor Analysis
::: {style="font-size:0.8em; line-height:1.2;"}
`These plots suggests, A professional, well-resourced support structure not only accelerates recovery from external shocks (like a pandemic) but also delivers uniformly higher success rates—regardless of mountain or team size.`
- **`Commercial expeditions rebound faster and to a higher plateau`**
In 2021, both types dipped (COVID impact), but commercial climbs dropped only to \~67%, whereas non-commercial fell to \~53%. By 2024, commercial success surged to \~95 %, while non-commercial languished around \~64%.
- **`Across individual peaks, company-run trips consistently outshine independent ones`**
On major peaks (Everest, Lhotse, Ama Dablam), commercial success rates hover in the 85–95 % range. Non-commercial attempts on the same mountains are 5–10 % lower.
- **`Team-size effects differ by expedition type`**
**Commercial**: success stays high (75–100 %) across all team sizes, peaking around 20 members.\
**Non-commercial**: very small groups (\> 80 %) and very large groups (\~75 %) do better, but mid-sized teams (5–10 climbers) dip down near 50 %
:::
## Question 2
The analysis will be very similar for this question. I suspect there will be some overlap, funding and success rate etc. We will use many of the same variables, but more specifically, FUNDING_TYPE.
We will Compare specific funding types like commercial vs. independent expeditions across countries for more insights
NOTE: Add "real world" examples of expedition success or failure to make analysis relatable.
### Approach
- Get Data from the CSV files
```{r}
#| code-fold: true
#| code-summary: "Code"
#| echo: true
#| results: hide
#| message: false
#| warning: false
exped <- read_csv("data/exped_tidy.csv")
```
- Data Wrangling & Variable Inferences
```{r}
#| code-fold: true
#| code-summary: "Code"
#| echo: true
#| results: hide
#| message: false
#| warning: false
# Lean tibble of variables of interest
exped_q2 <- exped |>
select(
YEAR, # Year of the expedition
SPONSOR, # Sponsor or funding organization
SUCCESS1, # Success on primary route
SUCCESS2, # Success on second route
SUCCESS3, # Success on third route
SUCCESS4, # Success on fourth route
TERMDATE, # Date expedition was terminated
TERMREASON, # Numeric code for termination reason
TERMREASON_FACTOR, # Description of termination reason
TOTMEMBERS, # Total members in the team
MDEATHS, # Number of member deaths
HDEATHS, # Hired staff deaths
OTHERSMTS, # Other summits during expedition
ACCIDENTS, # Description of accidents
AGENCY, # Trekking/expedition agency
O2USED #
) |>
mutate(
ANY_SUCCESS = SUCCESS1 | SUCCESS2 | SUCCESS3 | SUCCESS4,
FATALITIES = (MDEATHS + HDEATHS > 0),
OUTCOME_LABEL = case_when(
ANY_SUCCESS & !FATALITIES ~ "Success, No Deaths",
ANY_SUCCESS & FATALITIES ~ "Success + Deaths",
!ANY_SUCCESS & FATALITIES ~ "Failure + Deaths",
TRUE ~ "Failure, No Deaths"
)
) |>
select(-SUCCESS1, -SUCCESS2, -SUCCESS3, -SUCCESS4, -MDEATHS, -HDEATHS)
exped_q2 <- exped_q2 |>
mutate(
TERMINATION_TYPE = case_when(
TERMREASON %in% c(1, 2, 3) ~ "Successful",
TERMREASON %in% c(4, 5) ~ "Bad Weather/Conditions",
TERMREASON %in% c(6, 7) ~ "Accident/Illness/Exhaustion",
TERMREASON %in% c(8, 9, 10) ~ "Logistics/Technical",
TERMREASON %in% c(11, 12, 13) ~ "No Attempt/Base Only",
TERMREASON == 14 ~ "Other",
TRUE ~ "Unknown"
)
)
```
```{r}
#| code-fold: true
#| code-summary: "Code"
#| echo: true
#| results: hide
#| message: false
#| warning: false
library(stringr)
# Source: https://stackoverflow.com/questions/59082243/multiple-patterns-for-string-matching-using-case-when
sponsor_prefix_tbl <- exped_q2 |>
mutate(
SPONSOR_CLEAN = str_to_lower(str_trim(SPONSOR)),
SPONSOR_CLEAN = str_remove(SPONSOR_CLEAN, "\\s*\\b20\\d{2}\\b$"),
SPONSOR_PREFIX = str_extract(SPONSOR_CLEAN, "^\\w+\\s*\\w*")
) |>
count(SPONSOR_PREFIX, sort = TRUE)
#Sponsors <- exped_q2 |>
#distinct(SPONSOR_CLEAN) #|>
#count()
exped_q2 <- exped_q2 |>
mutate(
SPONSOR_CLEAN = str_to_lower(str_trim(SPONSOR)),
SPONSOR_CLEAN = str_remove(SPONSOR_CLEAN, "\\s*\\b20\\d{2}\\b$"),
SPONSOR_PREFIX = str_extract(SPONSOR_CLEAN, "^\\w+\\s*\\w*"),
FUNDING_TYPE = case_when(
str_detect(SPONSOR_PREFIX, "army|police|military|defense") ~ "Military",
str_detect(SPONSOR_PREFIX, "university|college|school|club|jac|univ") ~ "Academic/Alpine Club",
str_detect(SPONSOR_PREFIX, "guides|treks|adventure|travel|ascent|trip|climbalaya|madison|imagine|kobler|makalu|highland|satori|elite|glacier|dream|thamserku") ~ "Commercial",
str_detect(SPONSOR_PREFIX, "private|individual|self|solo|1st|friends|father|jon|jost|alex|kilian|marc|kishori|soren") ~ "Private/Individual",
str_detect(SPONSOR_PREFIX, "national geographic|japanese|korean|slovakian|french|chinese|indian|nepalese|russian|german|austrian|canadian|italian|spanish|swiss|american|latvian") ~ "National Program",
is.na(SPONSOR_PREFIX) | SPONSOR_PREFIX == "" ~ "Other/Unknown",
TRUE ~ "Other/Unknown"
)
)|>
mutate(
FUNDING_SIMPLIFIED = case_when(
FUNDING_TYPE == "Commercial" ~ "Commercial",
FUNDING_TYPE %in% c("Academic/Alpine Club", "National Program", "Military") ~ "Non-commercial",
FUNDING_TYPE %in% c("Private/Individual", "Other/Unknown") ~ "Independent/Other"
)
)
#glimpse(exped_q2)
```
### Motivation & Setup
#### Does Funding Type Affect Expedition Safety?
##### Purpose
Commercial expeditions are often seen as safer due to better logistics, guides, and resources.
The following analysis aims to address this assumption by comparing commercial, private, and non-commercial groups.
##### How Sponsors Were Categorized
To determine funding source the data was wrangled and sorted using the "SPONSOR" column. This was accomplished by repeatedly editing the sorted list. The sponsor column were nearly all unique. I found an example here:https://stackoverflow.com/questions/59082243/multiple-patterns-for-string-matching-using-case-when
#### Summary Statistics on Safety
```{r}
#| label: safety-summary-table
#| code-fold: true
#| code-summary: "Code"
#| echo: true
#| results: hide
#| message: false
#| warning: false
# Summarize safety metrics by funding type
library(dplyr)
library(knitr)
exped_q2 |>
group_by(FUNDING_SIMPLIFIED) |>
summarise(
n_expeditions = n(),
n_success = sum(ANY_SUCCESS),
n_fatalities = sum(FATALITIES),
pct_success = mean(ANY_SUCCESS),
pct_fatal = mean(FATALITIES),
.groups = "drop"
) |>
mutate(
pct_success = scales::percent(pct_success, accuracy = 0.1),
pct_fatal = scales::percent(pct_fatal, accuracy = 0.1)
) |>
kable(
caption = "Expedition Outcomes by Funding Type (2020–2024)",
col.names = c(
"Funding Type",
"Total Expeditions",
"Number Successful",
"Number with Fatalities",
"Success Rate",
"Fatality Rate"
),
format = "html"
)
```
Let's Start the plotting
### How do Expeditions End?
These figures shows that proportionally, accidents are more common with None-commercial expeditions. Termination due to bad weather appears uniform across all groups, which is expected.
::::: columns
::: {.column width="50%"}
```{r}
#| code-fold: true
#| code-summary: "Code"
#| echo: true
#| results: hide
#| message: false
#| warning: false
ggplot(exped_q2, aes(
x = FUNDING_SIMPLIFIED,
fill = TERMINATION_TYPE
)) +
geom_bar(position = "fill") +
scale_fill_brewer(
palette = "YlGnBu",
direction = 1,
name = "Termination reason"
) +
scale_y_continuous(labels = scales::percent_format()) +
labs(
x = NULL,
y = NULL,
fill = "Termination reason",
title = "How do expeditions end?",
subtitle = "By funding type, Himalayan expeditions 2020–2024",
caption = "Source: Himalayan Database via TidyTuesday (2025-01-21)"
) +
theme_minimal(base_size = 14) +
theme(
axis.text.x = element_text(size = 6),
axis.text.y = element_text(size = 6),
plot.title = element_text(face = "bold"),
plot.subtitle = element_text(margin = margin(b = 6), size = 8),
plot.caption = element_text(color = "gray40", size = 6),
legend.position = "right",
panel.grid.major.x = element_blank()
)
```
:::
::: {.column width="50%"}
```{r}
#| code-fold: true
#| code-summary: "Code"
#| echo: true
#| results: hide
#| message: false
#| warning: false
ggplot(exped_q2, aes(x = FUNDING_SIMPLIFIED, fill = OUTCOME_LABEL)) +
geom_bar(position = "fill") +
scale_y_continuous(labels = scales::percent) +
scale_fill_brewer(palette = "YlGnBu", direction = 1) +
labs(
title = "Expedition Outcomes by Funding Source",
subtitle = "Independent groups have more fatal or failed climbs",
x = NULL,
y = "Proportion of Expeditions",
fill = "Outcome"
)
```
:::
:::::
:::::: columns
::: {.column width="25%"}
:::
::: {.column width="50%" style="\"min-height:100px;"}
```{r, fig.height=3}
#| code-fold: true
#| code-summary: "Code"
#| echo: true
#| results: hide
#| message: false
#| warning: false
ggplot(exped_q2, aes(x = FUNDING_SIMPLIFIED, fill = OUTCOME_LABEL)) +
geom_bar(position = "fill") +
scale_fill_brewer(palette = "YlGnBu", direction = 1) +
scale_y_continuous(labels = scales::percent_format()) +
labs(
x = NULL,
y = "Proportion of expeditions",
title = "Distribution of Expedition Outcomes by Funding Type",
subtitle = "Includes success and death combinations for Himalayan expeditions (2020–2024)",
fill = "Outcome",
caption = "Source: Himalayan Database via TidyTuesday (2025-01-21)"
) +
theme_minimal(base_size = 14) +
theme(
axis.text.x = element_text(size = 10),
axis.text.y = element_text(size = 10),
legend.position = "right",
panel.grid.major.x = element_blank()
)
```
:::
::: {.column width="25%"}
<!-- blank right spacer -->
:::
::::::
### Fatality Trend
```{r}
#| code-fold: true
#| code-summary: "Code"
#| echo: true
#| results: hide
#| message: false
#| warning: false
exped_q2 |>
group_by(YEAR, FUNDING_SIMPLIFIED) |>
summarise(fatality_rate = mean(FATALITIES, na.rm = TRUE), .groups = "drop") |>
ggplot(aes(x = YEAR, y = fatality_rate, color = FUNDING_SIMPLIFIED)) +
geom_line(size = 1.2) +
geom_smooth(method = "loess", se = FALSE, linewidth = 1, linetype = "dashed") +
scale_color_brewer(palette = "YlGnBu", direction = 1) +
scale_y_continuous(labels = scales::percent_format()) +
labs(
title = "Trend in Member Fatalities by Funding Type",
subtitle = "Fatality rate = expeditions with ≥1 member or hired death",
x = "Year",
y = "Fatality Rate (%)",
color = "Funding Type",
caption = "Source: Himalayan Database via TidyTuesday (2025-01-21)"
) +
theme_minimal(base_size = 14)
```
It’s one thing to say commercial expeditions are safer but is that always true? I wanted to check whether fatality rates have changed over time, and if so, whether some groups are improving faster than others. This smoothed line chart shows the share of expeditions with fatalities by year and funding type. This is overlap with question 1.
### Team Size vs Fatality
Next question: does team size make a difference? You’d think larger teams might be safer due to support and redundancy, but also possibly riskier if they’re less experienced or move slowly. This scatterplot checks whether total team size correlates with success, failure, or fatalities — and whether that varies by funding type.
```{r}
#| label: plot-teamsize-vs-outcome
#| fig-cap: "Team Size and Expedition Outcome by Funding Type"
#| code-fold: true
#| code-summary: "Code"
#| echo: true
#| results: hide
#| message: false
#| warning: false
ggplot(exped_q2, aes(x = TOTMEMBERS, y = OUTCOME_LABEL, color = FUNDING_SIMPLIFIED)) +
geom_jitter(width = 0.3, height = 0.2, alpha = 0.6, size = 2) +
scale_color_brewer(palette = "YlGnBu", direction = 1) +
labs(
x = "Total team members",
y = "Outcome type",
title = "Team Size and Expedition Outcome by Funding Type",
subtitle = "Each point is one expedition (2020–2024)",
color = "Funding Type",
caption = "Source: Himalayan Database via TidyTuesday (2025-01-21)"
) +
theme_minimal(base_size = 14)
```
### Team Size by Funding As Proxy for Safety Resources
Large commercial teams often come with more resources: guides, oxygen, fixed ropes, medical support. That could partly explain why they have **lower fatality rates**.
Meanwhile, small non-commercial teams might:
- Take harder routes (fewer sherpa or porters)
- Push limits with less redundancy
- Face more exposure if things go wrong
So while team size alone doesn’t determine safety, it’s a **proxy** for resourcing — and resourcing links back to safety.
```{r}
#| code-fold: true
#| code-summary: "Code"
#| echo: true
#| results: hide
#| message: false
#| warning: false
library(ggridges)
exped_q2 |>
filter(TOTMEMBERS < 50) |> # Trim extreme outliers for better scaling
ggplot(aes(
x = TOTMEMBERS,
y = FUNDING_SIMPLIFIED,
fill = FUNDING_SIMPLIFIED
)) +
geom_density_ridges(
scale = 1.2,
rel_min_height = 0.01,
alpha = 0.6,
color = "white"
) +
scale_fill_brewer(palette = "YlGnBu", direction = 1) +
labs(
x = "Team Size (Total Members)",
y = "Funding Type",
title = "Distribution of Team Size by Funding Type",
subtitle = "Commercial expeditions tend to bring more people",
caption = "Source: Himalayan Database via TidyTuesday (2025-01-21)"
) +
theme_minimal(base_size = 14) +
theme(legend.position = "none")
```